# Libraries
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(patchwork))
sleep <- read.csv("cmu-sleep.csv")
It is important for students to succeed during their first year of college for numerous reasons. However, the transition to college life presents many challenges, which can often result in the comprising of a student’s sleep. Sleep is crucial to cognitive function, so the reduction in sleep could threaten a student’s ability to suceed in their first year of college. We hypothesize that better sleep could lead to a higher GPA, and suggest that university policy and student behavior can be adjusted accordingly to enhance academic outcomes.
The overarching aim is to discern the relationship between sleep habits and academic success during the pivotal first year of college. This exploration is structured into three distinct sections: - Firstly, the connection between sleep duration and academic achievement is examined. - Secondly, we consider whether sleep variability, as an indicator of sleep quality, might also correlate with GPA. - Lastly, we identify if these relationships hold true when accounting for demographic factors like race, gender, and first-generation college status.
Our data was collected from a study on the CMU Data Repository that surveyed first-year students from Carnegie Mellon University (CMU), The University of Washington (UW) and Notre Dame University (ND). In total, 634 students participated in the survey Students received a Fitbit to track their sleep and physical activity. Additionally, their GPAs were collected from their university’s registrar.
variables such as subject ID, study number, cohort, and demographic details are featured, alongside sleep-related metrics like bedtime variability and total sleep time. The investigation delves into the potential influence of sleep duration on academic performance, specifically changes in the end-of-semester grade point average (GPA) among first-year college students.
We have five categorical variables and ten quantitative variables as displayed below. Some variables described various features of the student such as race and gender; some variables described the sleeping habits of the student; and some variables described the student’s academic performance.
| Descriptive Variable | Description |
|---|---|
| Subject ID | Unique ID of the Subject. |
| Study | Study Number (corresponding to last table). |
| Cohort | Codename of the cohort that the subject belongs to. |
| Race | Binary label for underrepresented and non-underrepresented students (underrepresented = 0, non-underpresented = 1). |
| Gender | Gender of the subject (male = 0, female = 1), as reported by their institution. |
| First Generation | First-generation status (non-first gen = 0, first-gen = 1). |
| Sleep-Related Metric | Description |
|---|---|
| Bedtime MSSD | Mean successive squared difference of bedtime. This measures bedtime variability, and is calculated as the average of the squared difference of bedtime on consecutive nights. |
| Total Sleep Time | Average time in bed (the difference between wake time and bedtime) minus the length of total awake/restlessness in the main sleep episode, in minutes. |
| Midpoint Sleep | Average midpoint of bedtime and wake time, in minutes after 11 pm. |
| Fraction Nights with Data | Fraction of nights with captured data for the subject. |
| Daytime Sleep | Average sleep time outside of the range of the main sleep episode, in minutes. |
| Academic Performance Metric | Description |
|---|---|
| Cumulative GPA | Cumulative GPA (out of 4.0), for semesters before the one being studied. |
| Term GPA | End-of-term GPA (out of 4.0) for the semester being studied. |
| Term Units | Number of course units carried in the term. |
| Term Units (Adjusted) | Term Units adjusted for mean of 0 and standard deviation of 1. |
| Study | University | Semester |
|---|---|---|
| 1 | Carnegie Mellon University | Spring 2018 |
| 2 | University of Washington | Spring 2018 |
| 3 | University of Washington | Spring 2019 |
| 4 | Notre Dame University | Spring 2016 |
| 5 | Carnegie Mellon University | Spring 2017 |
First, the following code chunk generates the correlation matrix for continuous variables such as sleep times, GPA, and other numerical metrics to see how strongly these variables are related.
sleep_continuous <- sleep[c('bedtime_mssd', 'TotalSleepTime', 'midpoint_sleep',
'frac_nights_with_data', 'daytime_sleep', 'cum_gpa',
'term_gpa')]
cor_matrix <- cor(sleep_continuous, use = "complete.obs")
print(cor_matrix)
## bedtime_mssd TotalSleepTime midpoint_sleep
## bedtime_mssd 1.000000000 -0.1378871 0.41007395
## TotalSleepTime -0.137887141 1.0000000 -0.33204303
## midpoint_sleep 0.410073955 -0.3320430 1.00000000
## frac_nights_with_data -0.444754051 0.1151740 -0.29670431
## daytime_sleep 0.081458938 -0.2925153 0.08864347
## cum_gpa -0.006016101 0.1103745 -0.19142135
## term_gpa -0.035991253 0.2016771 -0.19454357
## frac_nights_with_data daytime_sleep cum_gpa
## bedtime_mssd -0.44475405 0.08145894 -0.006016101
## TotalSleepTime 0.11517399 -0.29251526 0.110374482
## midpoint_sleep -0.29670431 0.08864347 -0.191421349
## frac_nights_with_data 1.00000000 -0.06463782 0.044623099
## daytime_sleep -0.06463782 1.00000000 -0.143174723
## cum_gpa 0.04462310 -0.14317472 1.000000000
## term_gpa 0.07412054 -0.15302999 0.638035220
## term_gpa
## bedtime_mssd -0.03599125
## TotalSleepTime 0.20167715
## midpoint_sleep -0.19454357
## frac_nights_with_data 0.07412054
## daytime_sleep -0.15302999
## cum_gpa 0.63803522
## term_gpa 1.00000000
In the following code chunk, we plot histograms and density plots to
explore the distribution of important variables such as
TotalSleepTime, and term_gpa.
sleep_TotalSleepTime <- sleep %>%
ggplot(aes(x = TotalSleepTime)) +
labs(title = "Total Sleep Time Distribution",
y = "Density",
x = "Total Sleep Time (minutes)") +
geom_histogram(aes(y = after_stat(density)),
color = "deepskyblue4",
fill = "deepskyblue",
binwidth = 11.59) +
geom_density(fill = "deeppink",
alpha = 0.2)
sleep_term_GPA <- sleep %>%
ggplot(aes(x = term_gpa)) +
labs(title = "Term GPA Distribution",
y = "Density",
x = "Term GPA") +
geom_histogram(aes(y = after_stat(density)),
color = "darkorchid4",
fill = "darkorchid1",
binwidth = 0.1066) +
geom_density(fill = "dodgerblue",
alpha = 0.2)
sleep_TotalSleepTime + sleep_term_GPA
Sleep Time appears to be approximately normally distributed, centered around 400 minutes (which is roughly 6.5 hours). There are some outliers on both the lower and higher ends of the sleep time, but the bulk of the data falls within the normal range. The corresponding density plot confirms the bell-shaped curve.
The histogram for Term GPA shows that the data skews left, indicating that more students have a GPA closer to 4.0 than to the lower end of the scale. There’s a clear peak around the GPA of 3.5. The density plot overlays the histogram, providing a smoothed curve representation of the distribution, emphasizing the skew towards higher GPAs.
sleep %>%
ggplot() +
geom_point(data = subset(sleep, daytime_sleep <= 77),
aes(x = midpoint_sleep, y = TotalSleepTime, color = daytime_sleep),
alpha = 0.85) +
scale_color_gradient2("Average \nDaytime Sleep \n(minutes)",
low = "red", mid = "orange", high = "blue",
midpoint = median(sleep$daytime_sleep)) +
geom_point(data = subset(sleep, daytime_sleep > 77),
aes(x = midpoint_sleep, y = TotalSleepTime),
alpha = 0.85) +
labs(title = "Total Sleep Time vs. Midpoint of Sleep",
x = "Midpoint of Sleep (minutes after 11pm)",
y = "Total Sleep Time (minutes)")
sleep %>%
ggplot(aes(x = bedtime_mssd, y = midpoint_sleep, color = Zterm_units_ZofZ)) +
scale_color_gradient2(low = "red", mid = "orange", high = "blue") +
geom_point(alpha = 0.95)
sleep %>%
ggplot(aes(x = frac_nights_with_data, y = cum_gpa, color = daytime_sleep)) +
scale_color_gradient2(low = "red", mid = "orange", high = "blue") +
geom_point(alpha = 0.5)
sleep$cat_race <- as.factor(sleep$demo_race)
sleep$cat_gender <- as.factor(sleep$demo_gender)
sleep$cat_firstgen <- as.factor(sleep$demo_firstgen)
sleep %>%
ggplot(aes(x = cat_gender, fill = cat_race)) +
facet_wrap(~ cat_firstgen) +
geom_bar(position = "dodge")
library(ggridges)
sleep %>%
ggplot(aes(y = cat_gender, x = cum_gpa)) +
facet_grid( ~ cat_race) +
ggridges::geom_density_ridges(rel_min_height = 0.01,
alpha = 0.75,
aes(fill = cat_firstgen))
## Picking joint bandwidth of 0.232
## Picking joint bandwidth of 0.129
## Picking joint bandwidth of NaN
## Warning in FUN(X[[i]], ...): no non-missing arguments to max; returning -Inf
sleep %>%
ggplot(aes(x = daytime_sleep, y = term_gpa, color = Zterm_units_ZofZ)) +
geom_point(alpha = 0.75) +
scale_color_gradient2("Term Units \n(Z-Score)",
limit = c(-2, 2),
low = "red", mid = "green", high = "blue",
na.value = rgb(1, 0.96, 0.83, alpha = 0.001)) +
labs(title = "Term GPA vs. Daytime Sleep",
x = "Daytime Sleep (minutes)",
y = "Term GPA") +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
sleep %>%
ggplot(aes(x = TotalSleepTime, y = term_gpa, color = midpoint_sleep)) +
scale_color_gradient(low = "orange", high = "blue") +
geom_point(alpha = 0.5)
sleep %>%
ggplot(aes(x = Zterm_units_ZofZ, y = term_gpa, color = TotalSleepTime)) +
geom_point(alpha = 0.85) +
labs(title = "Total Sleep Time vs. Midpoint of Sleep",
x = "Term Units (Z-Score)",
y = "Term GPA") +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 147 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: Removed 147 rows containing missing values (`geom_point()`).
sleep <- sleep %>%
subset(daytime_sleep < 250)
summary(lm(cum_gpa ~ daytime_sleep, data = sleep))
##
## Call:
## lm(formula = cum_gpa ~ daytime_sleep, data = sleep)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2921 -0.2272 0.0860 0.3061 0.7084
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.561064 0.032507 109.547 < 2e-16 ***
## daytime_sleep -0.002322 0.000676 -3.434 0.000633 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4337 on 631 degrees of freedom
## Multiple R-squared: 0.01835, Adjusted R-squared: 0.01679
## F-statistic: 11.8 on 1 and 631 DF, p-value: 0.0006328
filter(sleep, cohort == "nh")$Zterm_units_ZofZ %>%
is.na() %>%
sum()
## [1] 147
sleep_quant <- sleep %>%
filter(!(cohort == "nh")) %>%
select(!c(subject_id, study, cohort,
demo_race, demo_gender, demo_firstgen,
cat_race, cat_gender, cat_firstgen))
sleep_pca <- prcomp(sleep_quant, center = TRUE, scale. = TRUE)
summary(sleep_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.4769 1.2763 1.1630 1.0107 0.93478 0.79195 0.73231
## Proportion of Variance 0.2424 0.1810 0.1503 0.1135 0.09709 0.06969 0.05959
## Cumulative Proportion 0.2424 0.4233 0.5736 0.6871 0.78423 0.85392 0.91350
## PC8 PC9
## Standard deviation 0.64686 0.60005
## Proportion of Variance 0.04649 0.04001
## Cumulative Proportion 0.95999 1.00000
suppressPackageStartupMessages(library(factoextra))
fviz_eig(sleep_pca, choice = "variance", ncp = 9, addlabels = TRUE) +
geom_hline(yintercept = 100 * (1 / ncol(sleep_quant)))
sleep_nh <- sleep %>%
filter(!(cohort == "nh"))
sleep_nh <- sleep_nh %>%
mutate(pc1 = sleep_pca$x[,1],
pc2 = sleep_pca$x[,2],
pc3 = sleep_pca$x[,3])
sleep_nh %>%
ggplot(aes(x = pc1, y = pc2)) +
labs(title = "PCA Plot for Sleep",
x = "PC 1",
y = "PC 2") +
geom_point(aes(color = as.factor(study)), alpha = 0.75)
fviz_pca_biplot(sleep_pca, label = "var",
alpha.ind = 0.25,
alpha.var = 0.75,
repel = TRUE,
# Set the color of the points to decades variable:
habillage = sleep_nh$cohort, pointshape = 19)
# standardize
# dist matrix
sleep_dist <- sleep_quant %>%
scale(center = FALSE,
scale = apply(sleep_quant, 2, sd, na.rm = TRUE))
sleep_dist <- sleep_dist %>%
dist(sleep_quant, method = "euclidean")
plot(as.dendrogram(hclust(sleep_dist, method = "single")))
plot(as.dendrogram(hclust(sleep_dist, method = "complete")))
suppressPackageStartupMessages(library(dendextend))
sleep_complete_dend <- as.dendrogram(hclust(sleep_dist, method = "complete"))
sleep_complete_dend <- set(sleep_complete_dend, "branches_k_color", k=5)
plot(sleep_complete_dend)
sleep_colors <- ifelse(sleep$study == 5, "red",
ifelse(sleep$study == 4, "orange",
ifelse(sleep$study == 3, "gold",
ifelse(sleep$study == 2, "green", "blue"))))
plot(set(sleep_complete_dend, "labels_colors",
order_value = TRUE, sleep_colors))